Most recent approaches to monocular 3D human pose estimation rely on DeepLearning. They typically involve regressing from an image to either 3D jointcoordinates directly or 2D joint locations from which 3D coordinates areinferred. Both approaches have their strengths and weaknesses and we thereforepropose a novel architecture designed to deliver the best of both worlds byperforming both simultaneously and fusing the information along the way. At theheart of our framework is a trainable fusion scheme that learns how to fuse theinformation optimally instead of being hand-designed. This yields significantimprovements upon the state-of-the-art on standard 3D human pose estimationbenchmarks.
展开▼